Miles Sound System SDK 7.2a

Working with Voice Input

Discussion

Before you can compress and transmit voice data, you'll need to acquire it. We've provided an easy-to-use set of API functions for voice input (see the Digital Input Services chapter), which allow the application to control and monitor the recording of audio data from the user's microphone or other input source. MSSCHTC.CPP demonstrates how to open a device for digital input and obtain incoming data from it by means of a callback function. The speech-compression codecs are not hard-wired into MSS's audio input/output architecture -- other voice-input mechanisms may be used by the application, including calling DirectSoundInput or the waveIn API directly for the absolute lowest possible latency. However, three aspects of voice input should be kept in mind regardless of how the data is obtained.

Use 16-bit mono audio at 8 kHz. The Voxware codecs are optimized for most efficient compression and reproduction at a sampling rate of 8 kHz. This is the standard sample rate used in almost all commercial and military speech-communication applications, including the US telephone system. The Voxware ASI codec implementations are designed solely for use with 16-bit PCM audio at 8 kHz, and will not support any other format.

Use full-duplex input. "Full duplex" simply means that the sound hardware is configured for simultaneous, independent recording and playback of audio data. With a full-duplex sound card, you don't have to stop playback in order to record speech, and vice versa. This doesn't mean that you have to leave recording on continuously, or that you have to continually send recorded audio data over the network - just that you don't have to shut down the playback subsystem in order to acquire input data. The Miles Sound System's integrated voice-input API supports only full-duplex audio hardware. Almost any sound card sold in the last five years will support full-duplex operation.

Choose voice-operated, continuous, or "push to talk" operation. Depending on the nature of your application (i.e., how "noisy" it is, how much network bandwidth is available, and under what circumstances your users will want to communicate) you'll need to decide how to control the flow of speech input.

Voice-operated (VOX) control may be appropriate in cases where the application doesn't generate a continuous stream of loud, raucous sound effects that would tend to cause false triggering. This can be implemented by leaving the input subsystem active all the time, and watching the incoming PCM input data buffers for an average signal level that exceeds a predetermined threshold, indicating that the user is speaking. Since you'll be dealing with 16-bit signed PCM data, you can simply examine the magnitude (absolute value) of individual samples or groups of samples in the buffer to determine their relative loudness. Typically it makes sense to average amplitude values over a period of time - several dozen milliseconds or more - in order to make sure the user has really begun speaking. VOX control may be an especially good match for applications that are bundled with telephone-operator-style headsets, or that otherwise require their use.

Continuous speech streaming may be useful in cases where network bandwidth is not particularly constrained, and/or your application supports only a few simultaneous users. This is the model implemented by the MSSCHTC and MSSCHTS example programs. It simply implies that the input API is left active on a constant basis, with all incoming data compressed and transmitted to the server for mixing and broadcasting.

Push-to-talk (PTT) offers a worthwhile compromise between the VOX and continuous operation models. The input subsystem may or may not be active on a continuous basis, but voice data is compressed and transmitted to the server only when the user holds down a particular key or input-device button. In noisy game environments with many simultaneous users, PTT is likely to be the mode of choice. In a bandwidth-saving variation on the PTT theme, each client could ask the server for "permission" to speak upon receiving a PTT input, allowing the server to limit the amount of incoming data from clients who are trying to speak at the same time.

Next Topic (Accessing the Codecs Directly with the RIB Interface)

Previous Topic (Voxware Voice Chat Codecs)

Group: Implementing Voice Chat